Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation

نویسندگان

  • Shixiang Lu
  • Xingyuan Peng
  • Zhenbiao Chen
  • Bo Xu
چکیده

This paper is concerned with exploring efficient domain adaptation for the task of statistical machine translation, which is based on extracting sentence pairs (pseudo in-domain subcorpora, that are most relevant to the in domain corpora) from a large-scale general-domain web bilingual corpus. These sentences are selected by our proposed unsupervised phrase-based data selection model. Compared with the traditional bag-of-words models, our phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation. These pseudo in-domain subcorpora can then be used to train small domain-adapted spoken language translation system which outperforms the system trained on the entire corpus, with an increase of 1.6 BLEU points. Performance is further improved when we use these pseudo in-domain corpus/models in combination with the true in-domain corpus/model, with increases of 4.5 and 3.9 BLEU points over single inand general-domain baseline system, respectively.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Mobile-Phone Translation Services for World Travelers

This demonstration introduces two new multilingual translation services for mobile phones. The first translation service provides state-of-the-art text-to-text translations of Japanese as well as English conversational spoken language in the travel domain into 17 languages using statistical machine translation technologies trained automatically from a large-scale multilingual corpus. The second...

متن کامل

Pseudo-morpheme and Confusion Network Based Korean-english Statistical Spoken Language Translation System

In this demonstration, we present POSSLT (POSTECH Spoken Language Translation) for a Korean-English statistical spoken language translation (SLT) system using pseudo-morpheme and confusion network (CN) based technique. Like most other SLT systems, automatic speech recognition (ASR) and machine translation (MT) are coupled in a cascading manner in our SLT system. We used confusion network based ...

متن کامل

A bootstrapping approach for developing language model of new spoken dialogue systems by selecting web texts

This paper proposes a bootstrapping method of constructing statistical language models for new spoken dialogue systems by collecting and selecting sentences from the World Wide Web (WWW). To make effective search queries that cover the target domain in full detail, we exploit the document set described about the target domain as seeding data. An important issue is how to filter the retrieved We...

متن کامل

A rank-predicted pseudo-greedy approach to efficient text selection from large-scale corpus for maximum coverage of target units

Selecting efficiently a minimum amount of text from a largescale text corpus to achieve a maximum coverage of certain units is an important problem in spoken language processing area. In this paper, the above text selection problem is first formulated as a maximum coverage problem with a Knapsack constraint (MCK). An efficient rank-predicted pseudo-greedy approach is then proposed to solve this...

متن کامل

Improving Low-Resource Neural Machine Translation with Filtered Pseudo-Parallel Corpus

Large-scale parallel corpora are indispensable to train highly accurate machine translators. However, manually constructed large-scale parallel corpora are not freely available in many language pairs. In previous studies, training data have been expanded using a pseudoparallel corpus obtained using machine translation of the monolingual corpus in the target language. However, in lowresource lan...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013